Introduction: Special thanks to: https://github.com/justmarkham for sharing the dataset and materials. Occupations
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called users
Step 4. Discover what is the mean age per occupation
Step 5. Discover the Male ratio per occupation and sort it from the most to the least
Step 6. For each occupation, calculate the minimum and maximum ages
Step 7. For each combination of occupation and sex, calculate the mean age
Step 8. For each occupation present the percentage of women and men
import pandas as pd
import numpy as np
import seaborn as sns
users = pd.read_csv(r'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users.head()
| user_id | age | gender | occupation | zip_code | |
|---|---|---|---|---|---|
| 0 | 1 | 24 | M | technician | 85711 |
| 1 | 2 | 53 | F | other | 94043 |
| 2 | 3 | 23 | M | writer | 32067 |
| 3 | 4 | 24 | M | technician | 43537 |
| 4 | 5 | 33 | F | other | 15213 |
Step 4. Discover what is the mean age per occupation
mean_age = users.groupby("occupation").age.mean()
mean_age
occupation administrator 38.746835 artist 31.392857 doctor 43.571429 educator 42.010526 engineer 36.388060 entertainment 29.222222 executive 38.718750 healthcare 41.562500 homemaker 32.571429 lawyer 36.750000 librarian 40.000000 marketing 37.615385 none 26.555556 other 34.523810 programmer 33.121212 retired 63.071429 salesman 35.666667 scientist 35.548387 student 22.081633 technician 33.148148 writer 36.311111 Name: age, dtype: float64
Step 5. Discover the Male ratio per occupation and sort it from the most to the least
#group gender by the occupation
all_genders = users.groupby("occupation")["gender"].value_counts(normalize=True)*100
#Convert to a dataframe and rename columns
df_gender = pd.DataFrame(all_genders)
df_gender.rename(columns={"gender":"percent"}, inplace=True)
#Add index
index_gen = pd.DataFrame(df_gender.reset_index(level=["occupation", "gender"]))
index_gen.head()
| occupation | gender | percent | |
|---|---|---|---|
| 0 | administrator | M | 54.430380 |
| 1 | administrator | F | 45.569620 |
| 2 | artist | M | 53.571429 |
| 3 | artist | F | 46.428571 |
| 4 | doctor | M | 100.000000 |
male_ratio = index_gen[index_gen["gender"] == "M"].sort_values(by="percent",ascending=False)
male_ratio
| occupation | gender | percent | |
|---|---|---|---|
| 4 | doctor | M | 100.000000 |
| 7 | engineer | M | 97.014925 |
| 37 | technician | M | 96.296296 |
| 29 | retired | M | 92.857143 |
| 27 | programmer | M | 90.909091 |
| 11 | executive | M | 90.625000 |
| 33 | scientist | M | 90.322581 |
| 9 | entertainment | M | 88.888889 |
| 17 | lawyer | M | 83.333333 |
| 31 | salesman | M | 75.000000 |
| 5 | educator | M | 72.631579 |
| 35 | student | M | 69.387755 |
| 25 | other | M | 65.714286 |
| 21 | marketing | M | 61.538462 |
| 39 | writer | M | 57.777778 |
| 23 | none | M | 55.555556 |
| 0 | administrator | M | 54.430380 |
| 2 | artist | M | 53.571429 |
| 20 | librarian | M | 43.137255 |
| 14 | healthcare | M | 31.250000 |
| 16 | homemaker | M | 14.285714 |
Step 6. For each occupation, calculate the minimum and maximum ages
users.groupby("occupation").age.agg([min, max])
| min | max | |
|---|---|---|
| occupation | ||
| administrator | 21 | 70 |
| artist | 19 | 48 |
| doctor | 28 | 64 |
| educator | 23 | 63 |
| engineer | 22 | 70 |
| entertainment | 15 | 50 |
| executive | 22 | 69 |
| healthcare | 22 | 62 |
| homemaker | 20 | 50 |
| lawyer | 21 | 53 |
| librarian | 23 | 69 |
| marketing | 24 | 55 |
| none | 11 | 55 |
| other | 13 | 64 |
| programmer | 20 | 63 |
| retired | 51 | 73 |
| salesman | 18 | 66 |
| scientist | 23 | 55 |
| student | 7 | 42 |
| technician | 21 | 55 |
| writer | 18 | 60 |
Step 7. For each combination of occupation and sex, calculate the mean age
users.groupby(['occupation','gender']).age.mean()
occupation gender
administrator F 40.638889
M 37.162791
artist F 30.307692
M 32.333333
doctor M 43.571429
educator F 39.115385
M 43.101449
engineer F 29.500000
M 36.600000
entertainment F 31.000000
M 29.000000
executive F 44.000000
M 38.172414
healthcare F 39.818182
M 45.400000
homemaker F 34.166667
M 23.000000
lawyer F 39.500000
M 36.200000
librarian F 40.000000
M 40.000000
marketing F 37.200000
M 37.875000
none F 36.500000
M 18.600000
other F 35.472222
M 34.028986
programmer F 32.166667
M 33.216667
retired F 70.000000
M 62.538462
salesman F 27.000000
M 38.555556
scientist F 28.333333
M 36.321429
student F 20.750000
M 22.669118
technician F 38.000000
M 32.961538
writer F 37.631579
M 35.346154
Name: age, dtype: float64
Step 8. For each occupation present the percentage of women and men
index_gen
| occupation | gender | percent | |
|---|---|---|---|
| 0 | administrator | M | 54.430380 |
| 1 | administrator | F | 45.569620 |
| 2 | artist | M | 53.571429 |
| 3 | artist | F | 46.428571 |
| 4 | doctor | M | 100.000000 |
| 5 | educator | M | 72.631579 |
| 6 | educator | F | 27.368421 |
| 7 | engineer | M | 97.014925 |
| 8 | engineer | F | 2.985075 |
| 9 | entertainment | M | 88.888889 |
| 10 | entertainment | F | 11.111111 |
| 11 | executive | M | 90.625000 |
| 12 | executive | F | 9.375000 |
| 13 | healthcare | F | 68.750000 |
| 14 | healthcare | M | 31.250000 |
| 15 | homemaker | F | 85.714286 |
| 16 | homemaker | M | 14.285714 |
| 17 | lawyer | M | 83.333333 |
| 18 | lawyer | F | 16.666667 |
| 19 | librarian | F | 56.862745 |
| 20 | librarian | M | 43.137255 |
| 21 | marketing | M | 61.538462 |
| 22 | marketing | F | 38.461538 |
| 23 | none | M | 55.555556 |
| 24 | none | F | 44.444444 |
| 25 | other | M | 65.714286 |
| 26 | other | F | 34.285714 |
| 27 | programmer | M | 90.909091 |
| 28 | programmer | F | 9.090909 |
| 29 | retired | M | 92.857143 |
| 30 | retired | F | 7.142857 |
| 31 | salesman | M | 75.000000 |
| 32 | salesman | F | 25.000000 |
| 33 | scientist | M | 90.322581 |
| 34 | scientist | F | 9.677419 |
| 35 | student | M | 69.387755 |
| 36 | student | F | 30.612245 |
| 37 | technician | M | 96.296296 |
| 38 | technician | F | 3.703704 |
| 39 | writer | M | 57.777778 |
| 40 | writer | F | 42.222222 |
Euro Teams
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address
Step 3. Assign it to a variable called euro12
import pandas as pd
import numpy as np
euro12 = pd.read_csv(r'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv')
euro12.head()
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 | 9 | 9 | 16 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 | 11 | 11 | 19 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 | 7 | 7 | 15 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 | 11 | 11 | 16 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 | 11 | 11 | 19 |
5 rows × 35 columns
Step 4. Select only the Goal column
euro12['Goals']
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64
Step 5. How many team participated in the Euro2012?
euro12.shape[0]
16
Step 6. What is the number of columns in the dataset?
euro12.shape[1]
35
Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
discipline = euro12[['Team','Yellow Cards','Red Cards']]
discipline
| Team | Yellow Cards | Red Cards | |
|---|---|---|---|
| 0 | Croatia | 9 | 0 |
| 1 | Czech Republic | 7 | 0 |
| 2 | Denmark | 4 | 0 |
| 3 | England | 5 | 0 |
| 4 | France | 6 | 0 |
| 5 | Germany | 4 | 0 |
| 6 | Greece | 9 | 1 |
| 7 | Italy | 16 | 0 |
| 8 | Netherlands | 5 | 0 |
| 9 | Poland | 7 | 1 |
| 10 | Portugal | 12 | 0 |
| 11 | Republic of Ireland | 6 | 1 |
| 12 | Russia | 6 | 0 |
| 13 | Spain | 11 | 0 |
| 14 | Sweden | 7 | 0 |
| 15 | Ukraine | 5 | 0 |
Step 8. Sort the teams by Red Cards, then to Yellow Cards
discipline.sort_values(['Red Cards','Yellow Cards'],ascending = False)
| Team | Yellow Cards | Red Cards | |
|---|---|---|---|
| 6 | Greece | 9 | 1 |
| 9 | Poland | 7 | 1 |
| 11 | Republic of Ireland | 6 | 1 |
| 7 | Italy | 16 | 0 |
| 10 | Portugal | 12 | 0 |
| 13 | Spain | 11 | 0 |
| 0 | Croatia | 9 | 0 |
| 1 | Czech Republic | 7 | 0 |
| 14 | Sweden | 7 | 0 |
| 4 | France | 6 | 0 |
| 12 | Russia | 6 | 0 |
| 3 | England | 5 | 0 |
| 8 | Netherlands | 5 | 0 |
| 15 | Ukraine | 5 | 0 |
| 2 | Denmark | 4 | 0 |
| 5 | Germany | 4 | 0 |
Step 9. Calculate the mean Yellow Cards given per Team
round(discipline['Yellow Cards'].mean(),2)
7.44
Step 10. Filter teams that scored more than 6 goals
euro12[euro12.Goals > 6]
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
2 rows × 35 columns
Step 11. Select the teams that start with G
(euro12[euro12['Team'].str.startswith('G')])
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
2 rows × 35 columns
Step 12. Select the first 7 columns
euro12.iloc[:,0:7]
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | |
|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 |
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 |
| 7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 |
| 8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 |
| 9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 |
| 10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 |
| 11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 |
| 12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 |
| 14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 |
| 15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 |
Step 13. Select all columns except the last 3
euro12[euro12.columns[0:-3]]
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Clean Sheets | Blocks | Goals conceded | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 0 | 10 | 3 | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 1 | 10 | 6 | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 1 | 10 | 5 | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 2 | 29 | 3 | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 1 | 7 | 5 | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 |
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 1 | 11 | 6 | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 1 | 23 | 7 | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 |
| 7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 2 | 18 | 7 | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 |
| 8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 0 | 9 | 5 | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 |
| 9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 0 | 8 | 3 | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 |
| 10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 2 | 11 | 4 | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 |
| 11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 0 | 23 | 9 | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 |
| 12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 0 | 8 | 3 | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 5 | 8 | 1 | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 |
| 14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 1 | 12 | 5 | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 |
| 15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 0 | 4 | 4 | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 |
16 rows × 32 columns
Step 14. Present only the Shooting Accuracy from England, Italy and Russia
sa = euro12.set_index('Team')
sa.loc[['England','Italy','Russia'],['Shooting Accuracy']]
| Shooting Accuracy | |
|---|---|
| Team | |
| England | 50.0% |
| Italy | 43.0% |
| Russia | 22.5% |
Housing
Step 1. Import the necessary libraries
import pandas as pd
import numpy as np
import random
Step 2.Create 3 differents Series, each of length 100, as follows:
The first a random number from 1 to 4
series1 = pd.Series(np.random.randint(1, high=5, size=100, dtype='l'))
The second a random number from 1 to 3
series2 = pd.Series(np.random.randint(1, high=4, size=100, dtype='l'))
The third a random number from 10,000 to 30,000
series3 = pd.Series(np.random.randint(10000, high=300001, size=100, dtype='l'))
Step 3.Create a DataFrame by joinning the Series by column
all_series = pd.concat([series1,series2,series3],axis=1)
all_series.head()
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 4 | 1 | 244444 |
| 1 | 2 | 1 | 83801 |
| 2 | 4 | 2 | 288413 |
| 3 | 1 | 3 | 190856 |
| 4 | 4 | 2 | 145935 |
Step 4.Change the name of the columns to bedrs, bathrs, price_sqr_meter
all_series.rename(columns = {0:'bedrs', 1:'bathrs' ,2:'price_sqr_meter'}, inplace =True)
all_series.head()
| bedrs | bathrs | price_sqr_meter | |
|---|---|---|---|
| 0 | 4 | 1 | 244444 |
| 1 | 2 | 1 | 83801 |
| 2 | 4 | 2 | 288413 |
| 3 | 1 | 3 | 190856 |
| 4 | 4 | 2 | 145935 |
Step 5.Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'
#join the values using concat
one_col_df = pd.concat([series1,series2,series3],axis=0)
#convert table to a dataframe
bigcolumn = one_col_df.to_frame()
#rename the column
bigcolumn.rename(columns = {0:'All series'},inplace = True)
bigcolumn
| All series | |
|---|---|
| 0 | 4 |
| 1 | 2 |
| 2 | 4 |
| 3 | 1 |
| 4 | 4 |
| ... | ... |
| 95 | 266475 |
| 96 | 250411 |
| 97 | 111942 |
| 98 | 229190 |
| 99 | 39256 |
300 rows × 1 columns
Step 6. Ops it seems it is going only until index 99. Is it true?
#len(bigcolumn) or
bigcolumn.count()
All series 300 dtype: int64
Step 7. Reindex the DataFrame so it goes from 0 to 299
bigcolumn.reset_index(drop=True, inplace=True)
bigcolumn
| All series | |
|---|---|
| 0 | 4 |
| 1 | 2 |
| 2 | 4 |
| 3 | 1 |
| 4 | 4 |
| ... | ... |
| 295 | 266475 |
| 296 | 250411 |
| 297 | 111942 |
| 298 | 229190 |
| 299 | 39256 |
300 rows × 1 columns
Wind Statistics
The first three columns are year, month, and day. The remaining 12 columns are average windspeeds in knots at 12 locations in Ireland on that day.
Step 1. Import the necessary libraries Step 2. Import the dataset from the attached file wind.txt Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
import pandas as pd
import datetime
# parse_dates gets 0, 1, 2 columns and parses them as the index
txt_data = 'wind.txt'
data = pd.read_csv(txt_data, sep = "\s+", parse_dates = [[0,1,2]])
data.head()
| Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2061-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1 | 2061-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 2 | 2061-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 3 | 2061-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 4 | 2061-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
# fix_date uses datetime
def fix_date(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_date on the column and replace values
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_date)
data.head()
| Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1 | 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 2 | 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 3 | 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 4 | 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
#date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])
#Set index
data = data.set_index('Yr_Mo_Dy')
data.head()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 6. Compute how many values are missing for each location over the entire record.They should be ignored in all calculations below.
data.isnull().sum()
RPT 6 VAL 3 ROS 2 KIL 5 SHA 2 BIR 0 DUB 3 CLA 2 MUL 3 CLO 1 BEL 0 MAL 4 dtype: int64
Step 7. Compute how many non-missing values there are in total.
data.notnull().sum()
RPT 6568 VAL 6571 ROS 6572 KIL 6569 SHA 6572 BIR 6574 DUB 6571 CLA 6572 MUL 6571 CLO 6573 BEL 6574 MAL 6570 dtype: int64
Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times. A single number for the entire dataset.
data.sum().sum() / data.notna().sum().sum()
10.227883764282181
Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days A different set of numbers for each location.
loc_stats = data.describe(percentiles=[])
loc_stats
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6568.000000 | 6571.000000 | 6572.000000 | 6569.000000 | 6572.000000 | 6574.000000 | 6571.000000 | 6572.000000 | 6571.000000 | 6573.000000 | 6574.000000 | 6570.000000 |
| mean | 12.362987 | 10.644314 | 11.660526 | 6.306468 | 10.455834 | 7.092254 | 9.797343 | 8.495053 | 8.493590 | 8.707332 | 13.121007 | 15.599079 |
| std | 5.618413 | 5.267356 | 5.008450 | 3.605811 | 4.936125 | 3.968683 | 4.977555 | 4.499449 | 4.166872 | 4.503954 | 5.835037 | 6.699794 |
| min | 0.670000 | 0.210000 | 1.500000 | 0.000000 | 0.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.040000 | 0.130000 | 0.670000 |
| 50% | 11.710000 | 10.170000 | 10.920000 | 5.750000 | 9.960000 | 6.830000 | 9.210000 | 8.080000 | 8.170000 | 8.290000 | 12.500000 | 15.000000 |
| max | 35.800000 | 33.370000 | 33.840000 | 28.460000 | 37.540000 | 26.160000 | 30.370000 | 31.080000 | 25.880000 | 28.210000 | 42.380000 | 42.540000 |
Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day. A different set of numbers for each day.
#create the dataframe
day_stats = pd.DataFrame()
day_stats['min'] = data.min(axis = 1) #min
day_stats['max'] = data.max(axis = 1) #max
day_stats['mean'] = data.mean(axis = 1) #mean
day_stats['standard_deviations'] = data.std(axis = 1) #standard deviations
day_stats.head()
| min | max | mean | standard_deviations | |
|---|---|---|---|---|
| Yr_Mo_Dy | ||||
| 1961-01-01 | 9.29 | 18.50 | 13.018182 | 2.808875 |
| 1961-01-02 | 6.50 | 17.54 | 11.336364 | 3.188994 |
| 1961-01-03 | 6.17 | 18.50 | 11.641818 | 3.681912 |
| 1961-01-04 | 1.79 | 11.75 | 6.619167 | 3.198126 |
| 1961-01-05 | 6.17 | 13.33 | 10.630000 | 2.445356 |
Step 11. Find the average windspeed in January for each location. Treat January 1961 and January 1962 both as January.
data.loc[data.index.month == 1].mean()
RPT 14.847325 VAL 12.914560 ROS 13.299624 KIL 7.199498 SHA 11.667734 BIR 8.054839 DUB 11.819355 CLA 9.512047 MUL 9.543208 CLO 10.053566 BEL 14.550520 MAL 18.028763 dtype: float64
Step 12. Downsample the record to a yearly frequency for each location.
data.groupby(data.index.to_period('A')).mean()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1961 | 12.299583 | 10.351796 | 11.362369 | 6.958227 | 10.881763 | 7.729726 | 9.733923 | 8.858788 | 8.647652 | 9.835577 | 13.502795 | 13.680773 |
| 1962 | 12.246923 | 10.110438 | 11.732712 | 6.960440 | 10.657918 | 7.393068 | 11.020712 | 8.793753 | 8.316822 | 9.676247 | 12.930685 | 14.323956 |
| 1963 | 12.813452 | 10.836986 | 12.541151 | 7.330055 | 11.724110 | 8.434712 | 11.075699 | 10.336548 | 8.903589 | 10.224438 | 13.638877 | 14.999014 |
| 1964 | 12.363661 | 10.920164 | 12.104372 | 6.787787 | 11.454481 | 7.570874 | 10.259153 | 9.467350 | 7.789016 | 10.207951 | 13.740546 | 14.910301 |
| 1965 | 12.451370 | 11.075534 | 11.848767 | 6.858466 | 11.024795 | 7.478110 | 10.618712 | 8.879918 | 7.907425 | 9.918082 | 12.964247 | 15.591644 |
| 1966 | 13.461973 | 11.557205 | 12.020630 | 7.345726 | 11.805041 | 7.793671 | 10.579808 | 8.835096 | 8.514438 | 9.768959 | 14.265836 | 16.307260 |
| 1967 | 12.737151 | 10.990986 | 11.739397 | 7.143425 | 11.630740 | 7.368164 | 10.652027 | 9.325616 | 8.645014 | 9.547425 | 14.774548 | 17.135945 |
| 1968 | 11.835628 | 10.468197 | 11.409754 | 6.477678 | 10.760765 | 6.067322 | 8.859180 | 8.255519 | 7.224945 | 7.832978 | 12.808634 | 15.017486 |
| 1969 | 11.166356 | 9.723699 | 10.902000 | 5.767973 | 9.873918 | 6.189973 | 8.564493 | 7.711397 | 7.924521 | 7.754384 | 12.621233 | 15.762904 |
| 1970 | 12.600329 | 10.726932 | 11.730247 | 6.217178 | 10.567370 | 7.609452 | 9.609890 | 8.334630 | 9.297616 | 8.289808 | 13.183644 | 16.456027 |
| 1971 | 11.273123 | 9.095178 | 11.088329 | 5.241507 | 9.440329 | 6.097151 | 8.385890 | 6.757315 | 7.915370 | 7.229753 | 12.208932 | 15.025233 |
| 1972 | 12.463962 | 10.561311 | 12.058333 | 5.929699 | 9.430410 | 6.358825 | 9.704508 | 7.680792 | 8.357295 | 7.515273 | 12.727377 | 15.028716 |
| 1973 | 11.828466 | 10.680493 | 10.680493 | 5.547863 | 9.640877 | 6.548740 | 8.482110 | 7.614274 | 8.245534 | 7.812411 | 12.169699 | 15.441096 |
| 1974 | 13.643096 | 11.811781 | 12.336356 | 6.427041 | 11.110986 | 6.809781 | 10.084603 | 9.896986 | 9.331753 | 8.736356 | 13.252959 | 16.947671 |
| 1975 | 12.008575 | 10.293836 | 11.564712 | 5.269096 | 9.190082 | 5.668521 | 8.562603 | 7.843836 | 8.797945 | 7.382822 | 12.631671 | 15.307863 |
| 1976 | 11.737842 | 10.203115 | 10.761230 | 5.109426 | 8.846339 | 6.311038 | 9.149126 | 7.146202 | 8.883716 | 7.883087 | 12.332377 | 15.471448 |
| 1977 | 13.099616 | 11.144493 | 12.627836 | 6.073945 | 10.003836 | 8.586438 | 11.523205 | 8.378384 | 9.098192 | 8.821616 | 13.459068 | 16.590849 |
| 1978 | 12.504356 | 11.044274 | 11.380000 | 6.082356 | 10.167233 | 7.650658 | 9.489342 | 8.800466 | 9.089753 | 8.301699 | 12.967397 | 16.771370 |
Step 13. Downsample the record to a monthly frequency for each location.
data.groupby(data.index.to_period('M')).mean()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1961-01 | 14.841333 | 11.988333 | 13.431613 | 7.736774 | 11.072759 | 8.588065 | 11.184839 | 9.245333 | 9.085806 | 10.107419 | 13.880968 | 14.703226 |
| 1961-02 | 16.269286 | 14.975357 | 14.441481 | 9.230741 | 13.852143 | 10.937500 | 11.890714 | 11.846071 | 11.821429 | 12.714286 | 18.583214 | 15.411786 |
| 1961-03 | 10.890000 | 11.296452 | 10.752903 | 7.284000 | 10.509355 | 8.866774 | 9.644194 | 9.829677 | 10.294138 | 11.251935 | 16.410968 | 15.720000 |
| 1961-04 | 10.722667 | 9.427667 | 9.998000 | 5.830667 | 8.435000 | 6.495000 | 6.925333 | 7.094667 | 7.342333 | 7.237000 | 11.147333 | 10.278333 |
| 1961-05 | 9.860968 | 8.850000 | 10.818065 | 5.905333 | 9.490323 | 6.574839 | 7.604000 | 8.177097 | 8.039355 | 8.499355 | 11.900323 | 12.011613 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1978-08 | 9.645161 | 8.259355 | 9.032258 | 4.502903 | 7.368065 | 5.935161 | 5.650323 | 5.417742 | 7.241290 | 5.536774 | 10.466774 | 12.054194 |
| 1978-09 | 10.913667 | 10.895000 | 10.635000 | 5.725000 | 10.372000 | 9.278333 | 10.790333 | 9.583000 | 10.069333 | 8.939000 | 15.680333 | 19.391333 |
| 1978-10 | 9.897742 | 8.670968 | 9.295806 | 4.721290 | 8.525161 | 6.774194 | 8.115484 | 7.337742 | 8.297742 | 8.243871 | 13.776774 | 17.150000 |
| 1978-11 | 16.151667 | 14.802667 | 13.508000 | 7.317333 | 11.475000 | 8.743000 | 11.492333 | 9.657333 | 10.701333 | 10.676000 | 17.404667 | 20.723000 |
| 1978-12 | 16.175484 | 13.748065 | 15.635161 | 7.094839 | 11.398710 | 9.241613 | 12.077419 | 10.194839 | 10.616774 | 11.028710 | 13.859677 | 21.371613 |
216 rows × 12 columns
Step 14. Downsample the record to a weekly frequency for each location.
data.groupby(data.index.to_period('W')).mean()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1960-12-26/1961-01-01 | 15.040000 | 14.960000 | 13.170000 | 9.290000 | NaN | 9.870000 | 13.670000 | 10.250000 | 10.830000 | 12.580000 | 18.500000 | 15.040000 |
| 1961-01-02/1961-01-08 | 13.541429 | 11.486667 | 10.487143 | 6.417143 | 9.474286 | 6.435714 | 11.061429 | 6.616667 | 8.434286 | 8.497143 | 12.481429 | 13.238571 |
| 1961-01-09/1961-01-15 | 12.468571 | 8.967143 | 11.958571 | 4.630000 | 7.351429 | 5.072857 | 7.535714 | 6.820000 | 5.712857 | 7.571429 | 11.125714 | 11.024286 |
| 1961-01-16/1961-01-22 | 13.204286 | 9.862857 | 12.982857 | 6.328571 | 8.966667 | 7.417143 | 9.257143 | 7.875714 | 7.145714 | 8.124286 | 9.821429 | 11.434286 |
| 1961-01-23/1961-01-29 | 19.880000 | 16.141429 | 18.225714 | 12.720000 | 17.432857 | 14.828571 | 15.528571 | 15.160000 | 14.480000 | 15.640000 | 20.930000 | 22.530000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1978-11-27/1978-12-03 | 14.934286 | 11.232857 | 13.941429 | 5.565714 | 10.215714 | 8.618571 | 9.642857 | 7.685714 | 9.011429 | 9.547143 | 11.835714 | 18.728571 |
| 1978-12-04/1978-12-10 | 20.740000 | 19.190000 | 17.034286 | 9.777143 | 15.287143 | 12.774286 | 14.437143 | 12.488571 | 13.870000 | 14.082857 | 18.517143 | 23.061429 |
| 1978-12-11/1978-12-17 | 16.758571 | 14.692857 | 14.987143 | 6.917143 | 11.397143 | 7.272857 | 10.208571 | 7.967143 | 9.168571 | 8.565714 | 11.102857 | 15.562857 |
| 1978-12-18/1978-12-24 | 11.155714 | 8.008571 | 13.172857 | 4.004286 | 7.825714 | 6.290000 | 7.798571 | 8.667143 | 7.151429 | 8.072857 | 11.845714 | 18.977143 |
| 1978-12-25/1978-12-31 | 14.951429 | 11.801429 | 16.035714 | 6.507143 | 9.660000 | 8.620000 | 13.708571 | 10.477143 | 10.868571 | 11.471429 | 12.947143 | 26.844286 |
940 rows × 12 columns
Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
week = data.resample('W').agg(['min','max','mean','std'])
# slice first 52 weeks
week.loc[week.index[1:53], "RPT":"MAL"] .head(5)
| RPT | VAL | ROS | ... | CLO | BEL | MAL | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | max | mean | std | min | max | mean | std | min | max | ... | mean | std | min | max | mean | std | min | max | mean | std | |
| Yr_Mo_Dy | |||||||||||||||||||||
| 1961-01-08 | 10.58 | 18.50 | 13.541429 | 2.631321 | 6.63 | 16.88 | 11.486667 | 3.949525 | 7.62 | 12.33 | ... | 8.497143 | 1.704941 | 5.46 | 17.54 | 12.481429 | 4.349139 | 10.88 | 16.46 | 13.238571 | 1.773062 |
| 1961-01-15 | 9.04 | 19.75 | 12.468571 | 3.555392 | 3.54 | 12.08 | 8.967143 | 3.148945 | 7.08 | 19.50 | ... | 7.571429 | 4.084293 | 5.25 | 20.71 | 11.125714 | 5.552215 | 5.17 | 16.92 | 11.024286 | 4.692355 |
| 1961-01-22 | 4.92 | 19.83 | 13.204286 | 5.337402 | 3.42 | 14.37 | 9.862857 | 3.837785 | 7.29 | 20.79 | ... | 8.124286 | 4.783952 | 6.50 | 15.92 | 9.821429 | 3.626584 | 6.79 | 17.96 | 11.434286 | 4.237239 |
| 1961-01-29 | 13.62 | 25.04 | 19.880000 | 4.619061 | 9.96 | 23.91 | 16.141429 | 5.170224 | 12.67 | 25.84 | ... | 15.640000 | 3.713368 | 14.04 | 27.71 | 20.930000 | 5.210726 | 17.50 | 27.63 | 22.530000 | 3.874721 |
| 1961-02-05 | 10.58 | 24.21 | 16.827143 | 5.251408 | 9.46 | 24.21 | 15.460000 | 5.187395 | 9.04 | 19.70 | ... | 9.460000 | 2.839501 | 9.17 | 19.33 | 14.012857 | 4.210858 | 7.17 | 19.25 | 11.935714 | 4.336104 |
5 rows × 48 columns
Step 1. Import the necessary libraries
import pandas as pd
import numpy as np
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called chipo.
Step 4. See the first 10 entries
data_url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
chipo = pd.read_csv(data_url, sep = '\t')
chipo.head(10)
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
| 5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
| 6 | 3 | 1 | Side of Chips | NaN | $1.69 |
| 7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 |
| 8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 |
| 9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 |
#removing dollar sign in item price
chipo['item_price'] = chipo['item_price'].str.replace('$', '')
chipo.head(5)
/var/folders/zx/z32cgc1s3cj0fh_nk6cynqh00000gn/T/ipykernel_51736/433152020.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
chipo['item_price'] = chipo['item_price'].str.replace('$', '')
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | 2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | 3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | 3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | 2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | 16.98 |
Step 5. What is the number of observations in the dataset?
chipo.shape[0]
4622
Step 6. What is the number of columns in the dataset?
chipo.shape[1]
5
Step 7. Print the name of all the columns.
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
Step 8. How is the dataset indexed?
chipo.index
RangeIndex(start=0, stop=4622, step=1)
Step 9. Which was the most-ordered item?
most_ordered = chipo.groupby('item_name').sum().sort_values(['quantity'], ascending=False)
most_ordered.head(1)
| order_id | quantity | |
|---|---|---|
| item_name | ||
| Chicken Bowl | 713926 | 761 |
Step 10. For the most-ordered item, how many items were ordered?
most_ordered_item = chipo.groupby('item_name').sum().sort_values(['quantity'], ascending=False)
most_ordered_item.head(1)
| order_id | quantity | |
|---|---|---|
| item_name | ||
| Chicken Bowl | 713926 | 761 |
Step 11. What was the most ordered item in the choice_description column? Step 12. How many items were orderd in total?
most_ordered = chipo.groupby('choice_description').sum().sort_values(['quantity'], ascending=False)
most_ordered.head(1)
| order_id | quantity | |
|---|---|---|
| choice_description | ||
| [Diet Coke] | 123455 | 159 |
Step 12. How many items were orderd in total?
chipo.quantity.sum()
4972
Step 13.
• Turn the item price into a float
• Check the item price type
• Create a lambda function and change the type of item price
• Check the item price type
item_price= lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(item_price)
chipo.item_price
0 0.39
1 0.39
2 0.39
3 0.39
4 6.98
...
4617 1.75
4618 1.75
4619 1.25
4620 0.75
4621 0.75
Name: item_price, Length: 4622, dtype: float64
chipo.item_price.dtype
dtype('float64')
Step 14. How much was the revenue for the period in the dataset? Step 15. How many orders were made in the period?
revenue = (chipo['quantity'] * chipo['item_price']).sum()
revenue
5105.02
Step 15. How many orders were made in the period?
chipo.order_id.value_counts().count()
1834
Step 16. What is the average revenue amount per order?
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped['item_price'].mean()
2.201286804798244
Step 17. How many different items are sold?
chipo.item_name.value_counts().count()
50
Create a line plot showing the number of marriages and divorces per capita in the U.S. between 1867 and 2014. Label both lines and show the legend. Don't forget to label your axes!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
marriages_divorces = pd.read_csv('us-marriages-divorces-1867-2014.csv')
#marriages_divorces.info()
marriages_divorces.head()
| Year | Marriages | Divorces | Population | Marriages_per_1000 | Divorces_per_1000 | |
|---|---|---|---|---|---|---|
| 0 | 1867 | 357000.0 | 10000.0 | 36970000 | 9.7 | 0.3 |
| 1 | 1868 | 345000.0 | 10000.0 | 37885000 | 9.1 | 0.3 |
| 2 | 1869 | 348000.0 | 11000.0 | 38870000 | 9.0 | 0.3 |
| 3 | 1870 | 352000.0 | 11000.0 | 39905000 | 8.8 | 0.3 |
| 4 | 1871 | 359000.0 | 12000.0 | 41010000 | 8.8 | 0.3 |
years = marriages_divorces['Year']
marriages = marriages_divorces['Marriages_per_1000']
divorces = marriages_divorces['Divorces_per_1000']
marriages_divorces = plt.figure(figsize=(16,8))
marriages_divorces = plt.plot(years, marriages, label='Marriages per 1000')
marriages_divorces = plt.plot(years, divorces, label='Divorces per 1000')
marriages_divorces = plt.title("Number of marriages and divorces per capita in the U.S. between 1867 and 2014")
marriages_divorces = plt.xlabel("Years",fontsize=14)
marriages_divorces = plt.legend(fontsize = 12, loc = "upper left")
marriages_divorces = plt.ylabel("Marriages",fontsize=14)
marriages_divorces = plt.grid(True)
marriages_divorces
Create a vertical bar chart comparing the number of marriages and divorces per capita in the U.S. between 1900, 1950, and 2000. Don't forget to label your axes!
import numpy as np
import pandas as pd
import plotly.graph_objects as go
marriages_divorces1 = pd.read_csv('us-marriages-divorces-1867-2014.csv')
filterd = marriages_divorces1.loc[marriages_divorces1.Year.isin([1900,1950,2000]),['Year','Marriages_per_1000','Divorces_per_1000']]
filterd
| Year | Marriages_per_1000 | Divorces_per_1000 | |
|---|---|---|---|
| 33 | 1900 | 9.3 | 0.7 |
| 83 | 1950 | 11.0 | 2.5 |
| 133 | 2000 | 8.2 | 3.3 |
#Stacked Bar Chart using plotly
fig = go.Figure(data=[
go.Bar(name='Marriages_per_1000', x= filterd['Year'], y=filterd['Marriages_per_1000']),
go.Bar(name='Divorces_per_1000', x=filterd['Year'], y=filterd['Divorces_per_1000'])
])
# Change the bar mode
fig.update_layout(barmode='stack',
title="Number of marriages and divorces per capita in the U.S. between 1900, 1950, and 2000",
xaxis_title="Years",
yaxis_title="Frequency",)
fig.show()
Create a horizontal bar chart that compares the deadliest actors in Hollywood. Sort the actors by their kill count and label each bar with the corresponding actor's name. Don't forget to label your axes!
import numpy as np
import pandas as pd
import plotly.graph_objects as go
dead = pd.read_csv('actor_kill_counts.csv')
dead
| Actor | Count | |
|---|---|---|
| 0 | Arnold Schwarzenegger | 369 |
| 1 | Chow Yun-Fat | 295 |
| 2 | Clint Eastwood | 207 |
| 3 | Clive Owen | 194 |
| 4 | Dolph Lundgren | 239 |
| 5 | Jet Li | 201 |
| 6 | Nicolas Cage | 204 |
| 7 | Sylvester Stallone | 267 |
| 8 | Tomisaburo Wakayama | 226 |
| 9 | Wesley Snipes | 193 |
fig = go.Figure(go.Bar(
x=dead["Count"],
y=dead["Actor"],
orientation='h'))
fig.update_layout(
title="Deadliest actors in Hollywood",
xaxis_title="Kill Count",
yaxis_title="Actor Names",
)
fig.show()
Create a pie chart showing the fraction of all Roman Emperors that were assassinated. Make sure that the pie chart is an even circle, labels the categories, and shows the percentage breakdown of the categories.
import pandas as pd
import numpy as np
import plotly.express as px
re = pd.read_csv("roman-emperor-reigns.csv")
re = re[re["Cause_of_Death"]=="Assassinated"]
re
| Emperor | Length_of_Reign | Cause_of_Death | |
|---|---|---|---|
| 2 | Caligula | 4.83 | Assassinated |
| 5 | Galba | 0.58 | Assassinated |
| 7 | Vitellius | 0.67 | Assassinated |
| 10 | Domitian | 15.00 | Assassinated |
| 17 | Commodus | 15.00 | Assassinated |
| 18 | Pertinax | 0.25 | Assassinated |
| 21 | Caracalla | 19.00 | Assassinated |
| 22 | Geta | 3.00 | Assassinated |
| 24 | Elagabalus | 3.75 | Assassinated |
| 25 | Severus Alexander | 13.00 | Assassinated |
| 26 | Maximinus I | 3.25 | Assassinated |
| 29 | Pupienus | 0.25 | Assassinated |
| 30 | Balbinus | 0.25 | Assassinated |
| 35 | Trebonianus Gallus | 2.00 | Assassinated |
| 36 | Aemilian | 0.16 | Assassinated |
| 38 | Gallienus | 15.00 | Assassinated |
| 41 | Aurelian | 5.00 | Assassinated |
| 43 | Florian | 0.25 | Assassinated |
| 44 | Probus | 6.00 | Assassinated |
| 52 | Severus II | 1.00 | Assassinated |
| 59 | Constans I | 13.00 | Assassinated |
| 65 | Gratian | 16.00 | Assassinated |
fig = px.pie(re, values=re.Length_of_Reign, names=re.Emperor)
fig.show()
Create a scatter plot showing the relationship between the total revenue earned by arcades and the number of Computer Science PhDs awarded in the U.S. between 2000 and 2009. Don't forget to label your axes! Color each dot according to its year.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
phd = pd.read_csv("arcade-revenue-vs-cs-doctorates.csv")
phd
| Year | Total Arcade Revenue (billions) | Computer Science Doctorates Awarded (US) | |
|---|---|---|---|
| 0 | 2000 | 1.196 | 861 |
| 1 | 2001 | 1.176 | 830 |
| 2 | 2002 | 1.269 | 809 |
| 3 | 2003 | 1.240 | 867 |
| 4 | 2004 | 1.307 | 948 |
| 5 | 2005 | 1.435 | 1129 |
| 6 | 2006 | 1.601 | 1453 |
| 7 | 2007 | 1.654 | 1656 |
| 8 | 2008 | 1.803 | 1787 |
| 9 | 2009 | 1.734 | 1611 |
a = phd.groupby('Year')
for name, group in a:
plot.plot(group['Total Arcade Revenue (billions)'], group['Computer Science Doctorates Awarded (US)'], marker='o', linestyle='', markersize=12, label=name)
plot.legend()
plot.xlabel("Total revenue earned by Arcades")
plot.ylabel("Computer Science Awards")
plot.figure(figsize=(20,15))
<Figure size 1440x1080 with 0 Axes>
<Figure size 1440x1080 with 0 Axes>